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The use of phase-only representations of speech for 
isolated word recognition is explored. Until recently the 
ear was thought to be short-term phase insensitive. 

However, short-term phase-only reconstructed speech has been 
Sae@mm CO metain much of the intelligibility of the original 
Signal. Using cepstral and analytic-signal processing 
techniques, a system for isolated word recognition is 
developed. The results of tests for both the speaker- 
dependent and speaker-independent case indicate that phase 
may be an important feature to consider in the development 
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I. INTRODUCTION 


fee ie comolexmcy Of man's machines increases, so does 
the need for simple, efficient man-machine interfaces. 
Automatic speech recognition plays a major role in this man- 
machine communication because of the superiority of speech 
over other modes of human communication. Speech is the most 
familiar and most convenient way for humans to communicate. 
Voice input leaves the hands and eyes of the operator free 
to perform other tasks and allows speaker mobility. 

Word recognition is one facet of the research conducted 
in the area of speech processing. Speech processing can be 
divided into three major categories. The speech analysis 
area includes word recognition, speaker identification, and 
speaker verification. The second category is speech 
synthesis. An example of synthesis is a data-retrieval 
system, where the computer responds verbally when its data 
base 1s interrogated. Another example is when a child 
receives a verbal response from his toy informing him he has 
correctly answered a question. The third area is a 
combination of the first two, speech analysis followed by 
speech synthesis. This has application in secure voice 
transmission and speech data rate reduction. As an example 


of the latter, the telephone company requires 64K bits/sec 





to transmit speech. The Department of Defense standard for 
Seomerave reduction is 2.4K bits/sec. The Air Force is 
experimenting with data rates as low as 150 bits/sec which 
provides intelligible speech. 

The advent of the general purpose digital computer 
in the mid-1960s provided speech researchers with a 
powerful tool. Numerous speech processing algorithms 
Meme digital signal process techniques have been developed 
Menmmoeth amalysis and synthesis. From using dynamic 
programming to time-warp speech prior to processing, to 
algorithms for extracting parameters to be used for speecn 
Synthesis, speech processing is a billion dollar a year 
mucmiess. 

Various speaker-dependent word recognition systems are 
commercially available. These systems generally perform 
some type of spectral analysis on the incoming speech 
Signal. The recognition process involves classical pattern 
recognition techniques. These systems have avery high rate 
of successful recognition. 

The success of these systems notwithstanding, the 
problem of constructing a speaker-independent recognition 
system remains unsolved. The solution to this problem 
involves determining what features of speech contain the 
information and hence are speaker independent. Before one 


can talk about extracting the information content from the 





speech signal, a look at a model of how humans produce 


Speeem is in order. 


w]e “SUN DAMENTALS OF SPEECH 

Flanagan ({Refs. 1 and 2] formulated a generally accepted 
model for human speech production. His model describes the 
vocal tract as a nonuniform acoustic tube connecting the 
vocal cords and the lips. In an adult male the vocal tract 
mr aeerOximately 17 cm. in length. 

The vocal tract can be connected to an ancillary cavity 
G@alled the nasal cavity. The coupling is accomplished 
through a trapdoor mechanism called the velum. The nasal 
em Beeins at tCme velum and terfliinates at the nostrils. 
iieanm aagult it is about 12 cm. long. When non-nasal sounds 
are produced the velum closes, thereby sealing off tne nasal 
Clamaity . 

Humans are capable of producing two types of sounds, 
voiced and unvoiced. In the case of voiced sounds air moves 
over the vocal cords causing them to vibrate in a quasi- 
periodic fashion. Unvoiced sounds are generated by either 
merming a ConStriction in the tract and forcing the air 
through at high velocity or by allowing pressure to build up 
behind the closure and then releasing it suddenly. The name 
fricative is associated with the former while plosive is the 


name given to the latter. 





Since the physical configuration of the vocal tract 
changes with time, Flanagan's model can be represented as a 


linear time-varying system as shown in Figure 1.1. 







Time Varying 
Filter 
vce 


y(t) 


Figure 1.1. Model of Speech Production 


If it is assumed that the vocal tract changes slowly 
with time the output can be approximated by the short-term 
convolution of the excitation, x(t), and the vocal tract 
impulse response, v(t). For voiced sounds x(t) is 
quasiperiodic hence the output y(t) is also quasiperiodic. 
For the unvoiced case the excitation x(t) is random and is 
generally approximated by white noise. 

If the vocal tract impulse response of an individual 
could be obtained, then using the time varying linear 
system model intelligible speech should be able to be 
generated. The excitation would either be periodic or 


random depending on whether voiced or unvoiced sounds are 
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desired. Figure 1.2iS a simplified speech synthesis 
machine where the vocal tract parameters are stored in the 
RAM and downloaded to the voice synthesis chip which is 


excited by either the periodic or the random signal. fhis 














E(t) 
Periodic 






Voice 
Synthesis 
Gldap 


Speech 


N(t) | 


Random 


Controller 


Figure 1.2. Voice Synthesis 


type of speech synthesis arrangement is the basis for Texas 
Instruments! (TI) Speak and Spell toys. TI can custom 
manufacture a speech synthesis chip which will emulate 


anyone's voice for $15,000. 
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These voiced and unvoiced sounds are combined in a 
unique fashion to form phonemes, the basic building blocks 
of language. All languages can be reduced to a finite 
number of these distinguishable building blocks. Phonemes 
are of such fundamental importance that if one phoneme is 
exchanged for another the meaning of an utterance is 
completely altered. 

ies. in theory, if a machine could be designed to 
disassemble utterances into their phoneme components the 
speech recognition problem would be completely solved. 
Despite vast amounts of time, effort, and money expended, 
however, the phoneme disassembler is years away from 


becoming an appears to be reality. 


Bee oreeCH RECOGNITION MACHINES 

While the phoneme disassembler does not exist, several 
types of speech recognition systems are commercially 
available. The majority of these systems are classified as 
isolated word recognizers. As the name implies the systems 
are designed to recognize isolated words. The vocabulary of 
these machines is usually limited to 100-300 words and these 
Systems are extremely speaker dependent. Thus, a person 
desiring to use these machines must first train the machine 
to recognize his voice. During the training phase the 
speaker's utterances are processed and templates formed. 


The recognition process involves comparing the incoming 
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utterance with those templates stored in the machine's 
memory {Ref. 3]. Although these machines have a limited 
vocabulary and cannot recognize connected or conversational 
speech, they are extremely useful for inventory control, 
moiey assurance control, or for a pilot to check the 
syeercems in a combat aircraft. In all theSe instances the 
vocabulary is limited, the speaker is Known, and voice data 
entry frees the individual to perform other tasks. 

ITT has developed a word recognition system for the Air 
Force's F-16 fighter. The system is capable of recognizing 
300 words and allows the pilot to check the status of 
certain systems while he maintains two hand control of the 
plane. This two-hand control is particularly important 
during low level, high speed attack runs. The pilots 
up-date their voice patterns monthly or if their voice 
Changes due, say, to a cold. The patterns are stored in a 
bubble memory and inserted into the system prior to 
take-off. The microphone is located inside the pilot's 
oxygen mask and the system status is displayed on the 
cockpit's CRT. At a recent demonstration of this system it 
had a correct recognition rate of 99%. 

The NPS Speech Processing Laboratory acquired an iso- 
lated word recognition system for experimentation purposes. 
The system is the VRM Voterm-2 manufactured by Interstate 


Electronics Corporation. The system, acquired in 1981, 
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weighs 10 lbs. and cost $2500. Today the same system has 
been reduced to a four chip set, for a cost of $1000. 

The operation of the VRM is typical of the word 
recognition systems currently available [Ref. 4]. It allows 
the user to select the vocabulary size, decision threshold 
and number of training passes. It also allows for reference 
pattern transfer between itself and the host computer. The 
host computer serves only as a mass storage device and 
controller. All processing and recognition is performed 
real-time by the VRM. 

The input speech signal is analyzed by a 16-filter 
analog spectrum analyzer and then passed through an A/D 
converter. This digitized speech data is then converted to 
a fixed-size (120 bit) pattern that preserves the informa- 
tion content of the utterance. During the training phase 
the VRM rejects utterances that do not sufficiently agree 
with previous training samples of the word. This rejection 
leads to a reduction of the number of ‘ones! stored in the 
pattern. After seven training passes the pattern contains 
approximately one hundred 'zeroes'". 

In 1980, NATO and the Rome Air Development Center (RADC) 
[Ref. 5] conducted a comparison test on three isolated word 
recognition systems. The vocabulary used consisted of the 
ten single digits of the respective languages of the 


speakers. The machines evaluated were the VRM system, the 
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Threshold Technology 8040 Preprocessor (cost $50,000) and 
the Nippon Electric DP-100 (cost $60,000). 

Table 1.1 lists the results from the RADC test [Ref. 6]. 
Each speaker trained the machines by repeating each digit 
ten times. No attempt was made to introduce speakers who 
had not trained the machine. However, tests run at the 
Speech Processing Lab with the VRM with some non-trained 
speakers, using the ten digits and three sets of reference 
patterns the successful recognition rate for new speakers 
was less than 30%. 

Thus, these systems work extremely well for what they 
were designed to accomplish. As previously stated, the 
basic question of what parameters of speech are speaker 
independent still remains unanswered. Numerous theories 
have been proposed and all have been unsuccessful. There is 
a lack of understanding of the human mechanisms used in 


understanding speech, 


a5 





000¢ 
00c¢8 
00g9t 
0099 
OOO0L 
009 

O0clL 
OOOE 


O09t 





SUdddn# 


OAN 


S9 °66 
98°66 
16°66 
hl °66 
00 °OOL 
£8 °66 
19°66 
00°OOL 
t9 °66 


Goo 
OAN 


OOOE 
O0tdl 
OOS 
0066 
OOSL 
006 
0081 
OOS 
O0OnS 





S49yang 
uSadUL WHA 


OL ° 86 
Lh °86 
| “6 
h9 °86 
EL * 86 
hh °S6 
8c °66 
69°86 
c0°66 


$9 ¥ 
usadul 


aa 


09°96 
he °86 
S8 ° 16 
60°86 
O02 °96 
Came 6 
Cao 
81°86 
GL * 86 


ZOOY 
WHA 


ebay .L 


Le 
OL 
61 


Ih 
6 
8 


Suayeodc# 


LS3L OLVN/OGVY YOH SADVINADYdd NOT LINDOOAY 


Llyv 


Ulyv 


UON 
SATIEN 
SATIEN 

UON 
SATIEN 

UON 


SATIEN 





SATJEN-UON 
JO SATYEN 


oTeWod/ TLV 


STEW/TILIV 


TV 
TV 
yong 
youadd 


yuouadg 


UST [Bug 


ysST[suq 





uayods 


afsensuey 


JS 





items. OF THE EAR 


For a long time people have been trying to understand 
bowerne human ear funetions. in the first century 8.C., the 
Roman poet, philosopher Lucretius postulated a model 
"involving little grains of sand in the inner ear responding 
too different tones" [Ref. 7]. The 18th century Italian 
violinist Tartini noted that the ear produced a third tone 
from two tones played simultaneously. Thus the long held 
belief that the ear was a linear device was demonstrated to 
be false. Today the ear is thought to be a nonlinear device 
even at power levels near the threshold of hearing. 

The first concentrated research into the process of 
hearing did not begin until the mid-1800's. This was the 
time of Seebeck, Helmholtz, and Ohm. It was Ohm who 
postulated a now famous law on the relationship of speech 
and its phase angle. He stated that all the information 
content of speech is contained in its power spectrum and was 
independent of the phase angle of the components. Although 
Ohm's law has been modified in recent years, it remains as 
one of the fundamental laws of psychoacoustics. 

The ear can be broken down into three physical areas; 
the outer, middle and inner ear. Sound waves impinge on the 


outer ear and are conducted down a canal until they reach 


ey 





pier middle ear. The middle ear contains three tiny bones. 
The alternate compressions and refractions of the speech 
wave cause the eardrum to strike the bones. In the inner 
ear the wave travels along a thin membrane whose frequency 
response varies between 100 Hz and 20 KHz. Tnis provides 
for spectral analysis of the incoming signal. 

The membrane of the inner ear is lined with tiny hairs. 
foes onese hairs or more correetly groups of hairs that 
perform the spectral analysis. Recent studies at the 
California Institute of Technology [Ref. 8] have found that 
each tiny hair bundle consists of 30-150 thin, rod-shaped 
extensions called cilia. These hair bundles are attached to 
hair cells. The hair cells are very sensitive transducers 
which convert the movement of the hair bundle into an elec- 
trical signal which is sent to the brain. The hair bundle- 
ia eeciwe combination form sort of mechanical spectrum 
analyzer. 

Manfred Schroeder [Ref. 9] describes an experiment in 
which the inner ear's sensitivity to phase was demonstrated. 
The experiment was as follows: 

1) A 100 sec. sample of speech was Fourier transformed. 


2) Random phase angles were assigned to the frequency 
components (assuming a uniform distribution 0 to 27r). 


3) The inverse Fourier transform was taken. 
The resultant signal sounded like white noise. Thus by 


randomizing the phase angles the signal was transformed from 
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ieeeelieen.y Sspecen tO noise. This lent credence to the 
hypothesis that the inner ear was phase sensitive and that 
Ohm's law, if not wrong, was at least in need of modifica- 
tion. The experiment was repeated this time using a 50 
msec. sample of speech. The resultant signal was non- 
intelligible noise. Ohm's law modified to say that only the 
pore term amplitude spectrum contained the speech 
information appeared to be correct. 

Ohm based his law on a model of the ear that said: 


1) The ear has a tuned bandpass filter covering the 
audio range. 


2) Only the output amplitude of each filter is sent to 
tne rain. 


Today the most likely candidate for the bandpass filter are 
the hair bundle-hair cell combinations that respond to only 
selected stimuli. 

In 1947 an experiment was conducted [Ref. 10] in an 
effort to obtain a definite answer to the phase sensitive 
question. An AM signal at 2000 Hz was modulated by a 100 Hz 
Signal. Thus three frequency components (1900 Hz, 2000 Hz, 
2100 Hz) were present. One of the sidebands had its phase 
shifted by 180°. This phase shift resulted in what was 
termed a quasi-FM (QFM) signal. Upon listening to the 
Signals there was a noticeable difference between the AM and 


QFM signal. Thus there was a revived interest in the ear's 
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capability to discern waveforms and not just their 
amplitude. 

In a further effort to determine to what extent phase is 
important in discerning speech, Hall and Schroder [Ref. 11] 
conducted an experiment where the phase angle of one of two 
pure tones was changed. Specifically two tones one at 200 
Hz and O° and another at 400 Hz but with phase angles of aye 
60°, feo- , 180°, BuO. and 300° were listened BOM oghimce 
Signals at a time. The listeners! task was to determine 
which two signals sounded most alike and which two sounded 
least alike. The results showed that those harmonics of 
400 Hz whose phase angle differed the least were judged to 
be the most similar consistently. 

About twelve years prior to this experiment researchers 
at Bell Labs postulated that the phase dependency seen in 
experiments involving the inner ear could be traced to the 
phase dependence of the inner and middle ear distortion 
products. Due to the presence of these nonlinear distortion 
products a new spectrum, called the inner spectrum was 
formed in the inner ear. [It is this spectrum that is 
analyzed by the hair bundles of the inner ear. 

This theory certainly would explain what happened at 
Bell Labs during a 1958 experiment [Ref. 12]. When the 
phase of one of 31-equal amplitude harmonics all Gc phase 


was changed to a 180° a pure tone was heard. This tone was 
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not heard when the signal was put through a loud speaker. 
Thus using the inner spectrum theory changing the phase of 
one harmonic to 180° altered the amplitude of one of the 
M@esceortion products. This altered the inner spectrum 
causing a bump in the spectrum where previously it had been 
Meat . 

In Germany, Terhardt and Fastl [Ref. 13] conducted 
experiments trying to connect frequency difference and phase 
angles. They formed a signal s(t) = a, cos (2nf,t) + a5 COs 
400 Hz and asked lis- 


(2nf,t-95) where f, = 200 Hz, f 


2 2 ~ 
teners to adjust the amplitude of each component so the 

400 Hz tone was just audible. This was to be done while the 
phase angle, 5) of the 400 Hz tone was changed. The 

results showed that when o> was changed from > tee Oa the 
amplitude of the 400 Hz signal had to be increased by 12 dB 
to remain audible. 

Yet another theory on the functioning of the ear came out 
of this experiment. The researeners theorized that the hair 
cells of the ear were discerning the time between successive 
spikes in the waveform and passed this information to the 
brain. This appeared as a reasonable explanation as when 5 


Fes 


the time between successive spikes was 2.5 msec. With 
b> = 180° the time between Spikes was 5 msec., unless the 


amplitude of the 400 Hz tone was increased by considerable 


eal 





amount. With the amplitude increased the small spikes at the 
2.5 msec. mark would increase dramatically. 

This theory iS consSiStent with the physiology of the 
ear. All the electric pulses transmitted to the brain from 
the hair cells have approximately the same amplitude, thus 
the timing between the pulses is the information that they 
carry. 


From the myriad of theories presented it is easy to 
conclude that a definitive model of the human ear is non- 
existent. The fact that phase contains some information 
content has been demonstrated. Whether phase alone is the 
speaker independent feature that researchers are looking for 
remains an unanswered question. Experiments conducted in 
the late 1970's and 1980's uSing phase-only representations 
of speech have given some creditability to the hypothesis 
that phase must be included as one of the speaker indepen- 


dent features of speech. 
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Pil eee NY RiPRESENTALIONS OF SPEECH 


Recapitulating, Ohm's law stated that all the 
information content of speech could be obtained from the 
short term power spectrum and that phase angle of the 
components was meaningless. Thus, in the short term the ear 
is phase deaf. Oppenheim [Ref. 14] sought to explore more 
fully the importance of phase in speech. 


Given the rourler transform of a speech signal 
F(w) = |F(w)Jedos»? (4) 


and if the |F(w)| is set equal to one, the inverse transform 


ae ajo lw) 


1S a phase only representation of the speech. 

This phase only representation retained total intelligi- 
bility, while exhibiting the characteristics of being high 
passed filtered and having white noise added. The magnitude 
only representation was speech-like in itS appearance but 
was not intelligible. 

Oppenheim concluded that transforming a signal to its 
phase only form was equivalent to passing it through a 
Spectral whitening process with a filter whose response is 
fioome= 17 )80x) |, where E(x) is the Fourier transform of the 
original signal. This spectral whitening did not destroy 


the intelligibility of the speech. 
a3 





Comeaany ctomeam' s law, Cox and Robinson [Ref. 15] 
conducted a series of four experiments which preserve the 
short term phase of a speech signal while either destroying 
or severely distorting the amplitude. These phase-only 
signals were found to retain many speech characteristics and 
were intelligible to the listeners. Hence under certain 
transformations snort term phase may be one of the physical 
invariants of speech. 

The experiments used a speech signal that was analog 
band limited to 8 KHz and sampled at arate of 20 KHz with 
i2 bits A/D. Successive 25.6 msec windows, corresponding to 
512 data points, were fast Fourier transformed. Nonlinear 
operations were applied to each data set, and the inverse 
fast Fourier transforms were taken yielding 25.6 msec of 
reconstructed speech signal. These signals were D/A 
converted at a rate of 20 KHz and passed through a 8 KHz low 
pass analog filter. Only rectangular windows were used and 
no attempt was made to fit the windows together since 
amplitude of the reconstructed signal was umimportant. fThe 
first two experiments are included for completeness only. 


The latter two are the concern of this thesis. 


A. SHORT-TERM PHASE ONLY SIGNALS 
This experiment basically repeated the previously 
mentioned work of Oppenheim, as the magnitude of the Fourier 


transform of the data sets was set equal to one. The phase 
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was unchanged. The reconstructed short-term phase only 
Signal was found to retain many of the original waveform's 
features. Listeners could identify speaker dependent 
characteristics and the intelligibility, while not judged 
good, was likened to a signal containing a lot of noise. 
There was no attempt made by the researchers to clean up the 
Signal. The results of this experiment clearly are contrary 
to Ohm's law and demonstrate that short-term phase only 


speech is intelligible. 


Be A WALYTIC SIGNAL PROCESSING 
The second experiment was a repeat of one carried out in 
the late 1940's. Here the representation is an infinitely 


clipped version of the original signal 


EiGiet) eae oem. S(t ) Be), 


where s(t) is the original signal, and Sgn is defined to be 
the sign of s(t). Thus the continous valued signal, s(t), 
was transformed into a discrete valued signal. The 
transformation retains only the real-zero information of 
s(t). That is, if s(t) was an analytic signal the real- 
zeros mark the time when the phase was changed by 180°. The 
intelligibility of such a signal was not commented on by the 
experimenters, however, they did say that large amounts of 


speech information were retained using this transform. 
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eeeeeeen oe) PHASE CEPSTRUM 

The concept of cepstral analysis of speech was developed 
by Oppenheim [Ref. 16] and is an example of a broad class of 
nonlinear processing called homomorphic processing. These 
homomorphic systems obey generalized laws of superposition. 
If x,(n) and x5(n) are inputs to a homomorphic system and 


y,(n), Y5(n) are corresponding outputs and k is any scalar 


then 
yan) = o[x,(n) ] 
yo(n) = $[x5(n)] 
ofx,(n) & x5(n)] = ofx,(n)] O olx5(n)] 
6[k O x,(n)] =k * y,(n) 
where A,f{J , © , and * are mathematical operations. 


The importance of these homomorphic systems is that 9 
can be broken down into a cascade of operations as shown in 


=a are inverses of each other and L 


Figure 3.1 where Ay? A, 
is a simple linear filter. 

Thus Oppenheim [Ref. 17] formulated a model for the 
production of speech as shown in Figure 3.2. The model is 
based on the assumption that the excitation and vocal tract 


parameters are independent. The source of excitation for 


the voiced sounds is the impulse generator whose period is 
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Figure 3.1.  Homomorphic System 
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Figure 3.2. Model for Speech Production 
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controlled by the pitch-period signal. The impulse 
generator produces an impulse once every No samples, where 
No is the pitch-period and ee is the pitch frequency. The 
unvoiced excitation is from the random number generator and 
Simulates both fricative and plosive sounds. The digital 
filter is assumed to be slowly varying with time and hence 
changes its coefficients once every 10 msec. The amplitude 
control simply adjusts the output level of the speech. 

Using this model the output digitized speech waveform 
Semersts of the convolution of 

(1) The train of impulses representing the pitch 

(2) The excitation pulse 

(3) The vocal tract impulse reponse. 


If x(n) denotes the output signal, then 

x(n) = [p(n) * e(n) * u(n)] wn) Gey 
where p(n) is the train of pitch pulses, e(n) is the 
excitation pulse, u(n) the vocal tract impulse response, and 


w(n) the window through which the speech is viewed. The 


Window w(n) is smooth, hence we can define 


p(n) = p(n) w(n) (3.4) 
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ie saDStltvuvine tnis into equation (3.3) it is possible to 
approximate x(n) by 


A 


men) = pin) * e{n) * u(n) (355) 


EXamininge equation (3.5) it is possible to convert the 
meee Convolution into a triple sum by first taking the 
Fourier transform and then taking the logarithm. Processing 
of this signal can be accomplished by a linear system and 
recovery of the waveform can be made by passing the 
processed signal through an exponentator followed by inverse 
Fourier transformer. Thus a homomorphic system for 
processing speech has been developed, as shown in Figure 3.3 


Meer. 18]. 





Page 3.3.  #Homemorphic System for Processing Speech 


Variations on this basic system have been developed to 


estimate parameters of both the vocal tract transmission 


ras 


fl 





functions and the excitation functions. One of these 
variations involves making the assumption that the 
Beeitation is S(n) = p(n) * e(n), then equation (3.5) can be 


written as 
Ch =enee * Sen > (370)) 


The system to process signals given by equation (3.6) is 
shown in Figure 3.4 [Ref. 19]. 

Referring to Figure 3.4, the signal at A is x(n) and the 
signal at Dis called the cepstrum of x(n) and equals the 
cepstra of the excitation plus the cepstra of the vocal 


tract impulse response. 





Data Ceptrum 
Window Window 


Figure 3.4. Cepstral Processing of Speech 
An important feature of the cepstrum at Dis that it 


separates the excitation from the vocal tract response. The 


excitation is a sequence of quasi-periodic pulses, thus its 
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Fourier transform, at point B, is a line spectra where the 
lines are spaced at harmonics of the fundamental frequency. 
The log magnitude operation does not effect the general 
shape of the spectra. fhe IDFT of the signal produces 
another quasi-periodic waveform with pulses spaced at the 
fundamental period. Thus the cepstrum of the excitation 
Sie@uld consist of pulses around n = 0, T, eI, ..., where T 
io cme pitch period. 

The DFT of the vocal tract response is a slowly varying 
function of frequency. The log magnitude and IDFT yield a 
sequence that is negligible after a few samples. The cep- 
strum at D consists of two sequences, one which is negligi- 
ble after a few samples and one that is periodic. Thus the 
cepstrum at D does differentiate the excitation from the 
vocal tract parameters. The use of the cepstral processing 
has been extended into many diverse fields [Ref. 20]. 

For their third experiment, Cox and Robinson [Ref. 21] 
modified Figure 3.4 by setting the magnitude of the signal 
at point C equal to one. dence the cepstrum at point Dis 
due only to the phase of the signal at A. What amount of 
information and intelligibility does this phase only 
cepstrum contain? Surprisingly the cepstrum was judged to 
be very intelligible by listeners and the noise level was 
reduced when compared with the short-term phase only speech 


(experiment number one). 
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PeetotANtANmOUS PHASE OF THE ANALYTIC SIGNAL 

The fourth experiment performed by Cox and Robinson 
(Ref. 22] was first performed in 1955 by Marcoui and Daguet 
who were looking for more efficient modulation techniques. 
They sought to use the analytic signal representation of a 
real signal s(t). Given a real signal s(t), which is 
Hilbert transformable, form a quadrature signal s*(t) and 


ee@nstruct 
x 
Plae.4 = -SiGto es) Smet) Cee 


From equation (3.7) it is possible to recover the original 


Signal as 
SC) een oimet)}] = ImCt) | cos 6(t) (3.8) 


Equation 3.8 lets the real signal, s(t), be represented by a 
magnitude and phase. 

The concept of an analytic signal, which equation (3.7) 
is called, was meaningless for discrete-time signals, until 
Rabiner and Schafer [Ref. 23] developed a complex represen- 
tation for real discrete-time bandpass signals. 

Following the notation of Rabiner and Schafer, given a 
real sequence, x(n), with Fourier transform X(w), construct 


a complex sequence 


SZ 





~ 


ao) mane + J - x(n) Geo) 
Tie Fourier transform of which is 


X(w) 


=) 2a o ) O< wtt (30!) 
= Q Tw) On 
meom eduation (3.9) the Fourier transform of x(n) is 
X(w) = X€w) + j Xo) Ca iabal) 
eeor irom equation (3.10) it follows that 
Giese 4 XCH) = 0 1 ea 
and 
Rio) =) 2ACw) Ow < ot 
These requirements are satisfied if 
X(w) = H Cw) X (Cw) Gale) 
where 
Hyfw) = =J OFG <x 1 
= + 3 a < Omeeec t ar | Se 


S16) 





Thus given any sequence x(n), it is possible to obtain the 
sequence x(n) DeeudesGgerieehering Of x(n) with a filter 
whose frequency response is given by equation (3.13). Such 
peralter is called an ideal Hilbert transformer and x(n) is 
mime Hilbert transform of x(n). The impulse response of the 


ideal Hilbert transformer 1S 


sinc (52) n #0 (2 a1 


De 


h(n) 


Examining equation (3.14), the impulse response is non- 
Semeet. of infinite duration, has odd symmetry, and all 
even-numbered samples are equal to zero (il.e., hy (en) = 0, 
a0, +1, +2, +3, +--+). 

Since infinite length, non-causal impulse responses are 
not realizable an FIR approximation is required. Given a 
causal FIR system whose impulse response is h(n), 0 <n < Net, 


its frequency response is given by 


N= 1 
fi.) See 2 | Hn TJ 


n-0 


wh 


C34 5) 


Equation (3.13) says the desired frequency response, Hy Cw) ; is 
purely imaginary. Thus the real part of equation (3.15) must 
equal zero as h(n) is real. In order for the real part of 
equation (3.15) to be zero h(n) must satisfy the symmetry 


condition 
34 





h(n) = -hn(N-1-n) eee. 6, Nw 1 Gris ee, 


mw msmogd, Hin) wiase edad symmetry about n = (N-1)/2. If Nis 
even, h(n) has odd symmetry about a point halfway between the 
Bemebes at n = N/2 and n = (N/2) + 1. If equation (3.16) is 


Satisfied, equation (3.15) can be written as 
eta cee 7] (Hu) ] (3.17) 


% % 
mere HH (») iS a real function of w. If N is odd, H (w) can 


be written as 


(N=-1)/72 
* 
H (w) = » a(n) sin(wn) CSS lie, 
n= 1 
where a(n) = 2h (St 9/4) aruen = 1) eae —) (3.19) 
Binso for N odd, 
n(=54) nG (3.20) 
For N even, equation (3.18) becomes 
N/2 
x 
Hees PL Bon) sinfo(n - 1/2) (3.21) 
n= 1 


where b(n) = 2n(5 TORR = le 5.6.9 N/2 
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Examining equation (3.17) more closely, we find that the 


Factor eJUiN-1) 72 


is a delay of (N-1)/2 samples. 
In finding an approximation to the ideal Hilbert 
transform, coefficients a(n) and b(n) were chosen in such a 


% 
fashion that jH (w) approximates the ideal frequency response 


: 
given by equation (3.13). Thus H (w) must approximate 
D(w) = -1 enFy Ke wr S onF C3522) 
=> +1 en(l - Pia) KOS enF 


where ae and Pu are the lower and upper cutoff frequencies 
represented as fractions of 2n. From equation (3.18), macs) 
must equal zero at w = 0 and w= nr when N is odd and must 
equal zero at w = 0for the case when Nis even. 

For the ideal transformer the impulse response was zero 
for all even numbered samples and the frequency response was 


imaginary, odd, periodic and 
Hyfw) = HoCr - wo). 


For the FIR approximation similar properties must be 


Velbid. if N is odd and a =o Py and assuming that 
% * 
ecw) ea (x = i)! Caras) 


Then substituting into equation (3.18) yields, 
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(N-1)/2 (N-1)/2 


> ae stmcnw) = »» a(n) sin[(r - w)n] 
n= 1 n= 1 
(N-1)/2 
= » a(n) eae Sin (on) 
n= 1 


rearranging terms 


(N-1)/2 
xy a(n) sin [wn(1 - expe) = 0 
n= 1 
Tas an =O n even 


unconstrained n odd. 


Combining this result with equations (3.16), (3.19), and 
memo) have that for (N-1)/2 even, h(n) = 0, for n = 0, 2, 
mumenecdewnen (Nel)/2 1S odd, h(n) = 0, for n= 1, 3, 5, 
For the case of N even no relationship among the 
S@erflecrents exist. 

One important difference between even and odd length 
impulse responses can be seen in direct convolution. The 
convolution summation given by 


~ 


N-1 
x(n) = d  h(k) x(n-k) 
K=-0 


involves only (N+1)/4 multiples per output sample for N odd 


and N/2 multiples for N even. The saving occurs because 


55 // 





alternate values of h(n) are zero for N odd. Because of 
this Savings and for technical considerations only Hilbert 
transformers of odd length are used. 

In determining the values of h(n), Rabiner and Schafer 
[Ref. 24] used the Remez algorithm for the design of optimal 
FIR filters. The values of h(n) were calculated to minimize 


the peak approximation error which is given by 
‘x 
G = MAX [D(w) - # (wo) ] (3.24) 


Onk CORES on 


I A 


The Remex algorithm gives a Chebyshev or equiripple 
approximation to the desired response. Hence the error 
function is equiripple over the range enk Kms enue 

Given an N, Ph and Pu the resulting approximation is best in 
the mimimax sense. 

Using this concept of an analytic signal representation 
for discrete-time signals, Cox and Robinson (Ref. 25] formed 
the analytic phase representation of a speech signal. Given 
a sampled speech signal, s(n), they calculated the Hilbert 
Wieanisiorin , smc by the use of a 79-weight Hilbert trans- 


former. Thus having the analytic signal 


% 
mn) = s(n) + j s (n) 
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the original signal s(n) is given by 
s(n) = |m(m)| cos ex@m. 


The analytic phase representation is given by cos e(n). Thus 
by way of a mathematical artifice a real-valued sequence s(n) 
is represented as having magnitude and phase with the phase 
only being retained. Contrary to common sense, perhaps, this 
analytic phase representation of speech was found to be 
intelligible. While these experiments by themselves do not 
prove that phase is a physical invariant of speech, they do 
indicate that more researen is needed to determine to what 
role phase plays in speech intelligibility. 

As was mentioned, a 79-weight Hilbert transformer was 
used in obtaining the analytic signal. Rabiner and Schafer 
(Ref. 26] calculated weights for three different values of 
peak approximation errors and cutoff frequencies. Table 3.1 
Mets these weights and Figures 3.5 through 3.7 are plots of 
the magnitude of the frequency response. Table 3.1 only 
lists even weights, since 79 is odd, all odd weights are 


zero and the weights have odd symmetry about n = 39. 
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Pee cen iweN TAL PROCEDURE 


This thesis extends the work of Cox and Robinson to the 
isolated word recognition field. Specifically using the 
homomorphic and analytic signal processing techniques 
employed in experiments three and four an isolated word 


recognition system is developed. 


A. DATA ACQUISITION 

In order to form a data base for use by the system 
twenty volunteers were recruited to record the digits zero 
through nine. Each participant was given a questionnaire/ 
instruction sheet like that contained in Appendix A. All 
speakers were males between the ages of 25 and 35 and all 
were native English speakers. Their places of birth varied 
from eastern Pennsylvania to southern Tennessee. Ten of 
these speakers were selected to form the data base or 
pattern base of the system. The other ten speakers were 
used to test the system. 

The speech was recorded on an analog tape recorder with 
all recordings being done in the Speech Processing 
Laboratory. The recordings were done in the late afternoon 
or in the evening when the ambient noise level was at a 
minimum. The tape recorder used was the HP-3964A reel-to- 
reel instrumentation recorder running at 7.5 ips using AMPEX 


professional audio tape. 
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Before this analog speech could be digitized an 
appropriate bandwidth and sampling rate had to be 
determined. The power spectral density of each digit was 
computed and averaged over ten utterances of the digit. The 
majority of the power was found to be below 3 KHz except in 
the case of the number 'six' where nonnegligible power was 
found to frequencies up to 6 KHz. A cutoff frequency of 4 
KHz was chosen, which is exactly half the bandwidth that Cox 
and Robinson used. As will be explained later, once the 
bandwidth is fixed the sampling rate is also fixed. In this 
case the sampling rate is fixed at 10 KHz. 

The machine used to digitize the speech was the GENRAD 
2505 Signal Analysis System [{Ref. 27]. The system is a 
narrowband (0 - 25 KHz) signal analysis system originally 
designed for vibrational analysis studies. The system uses 
a DEC PDP 11/34A as the host computer and Supports two 
channels of A/D conversion. 

The neart of the system, softwarewise, is GENRAD's Time 
Series Language (TSL) which allows the operator to control 
the A/D converter. TSL is an interpretive language which 
uses commands Similar to BASIC. fThe TSL program 'ANADSK! is 
the routine that provides analog input to disk storage. 
Given a bandwidth the 'ANADSK' routine sets the sampling 
rate at 2.56 times the highest frequency component to 


prevent aliasing. The system provides for high-speed 
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continuous sampling and writes the digitized data to the 
system's Winchester disks in 2048 byte blocks. 

The two-channel A/D converter has two 6-pole Chebychev 
filters in cascade each with 96 dB/octave rolloff above 
cutoff per channel as anti-aliasing filters. The A/D 
converter is a 2 ysec converter with a 12 bit output. 

Once the speech was digitized a time window for the 
sampled data had to be determined. Referring again to the 
utterances whose power spectral densities were computed, the 
average length of the utterances was 740 msec. In order for 
the mathematics to work out nicely a 750-msec window was 
chosen. 

weriewrol library routines *REIO' and "KDISPL' a routine 
was written that displayed the digitized data on the 
system's CRT. The program graphically displayed 1024 
Samples at a time and allowed the operator to select any 256 
samples for transfer to the W. R. Church Computer Center's 
IBM 3033 for processing. This transfer was via a 1200 baud 
modem. With the capability to view the data prior to 
transfer, the start of the utterance could be selected to 
within 128 samples. Since the time window was selected to 
be 750 msec and the speech was sampled at 10,240 samples/ 
sec, 7680 points needed to be transferred. Thus thirty 


blocks of 256 samples each were transferred per utterance. 
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The transfer/interface program between the Speech Lab's 
PDP 11/34A and the IBM 3033 was written by LT Jay H. Benson. 
mecopy Ot his program, ‘CATCH’, is included in Appendix B. 
The transfer of data via the modem was very time consuming 
as for technical reasons each sample which occupied two 
bytes on the PDP 11/34A was made into a four byte number for 
transfer. The sixteen most significant bits were then 
masked off prior to storage on the IBM system. In order to 
minimize the amount of disk storage required, the data was 
written to the disk using an unformatted FORTRAN write 
statement, using Integer * 2 numbers. Even using this 
scheme to maximize storage efficiency 24 cylinders plus 


magnetic tape backup were required to store the data. 


Ber PATA PROCESSING 

The decision to use the IBM system to process the data 
Was based on the availability of library routines (e.g., 
IMSL, NONIMSL), the DISSPLA graphics package, and the full 
Semeen text editor. All programs in Appendix B were written 
in FORTRAN H. 

The first task was to compute an average waveform for 
the speaker. In order to accomplish this, three of the 
four utterances of each of 10 speakers were averaged 
together. The program 'MEANS' was used to compute this 
average. The technique is very simple and straightforward 


as the ensemble mean was computed. This agrees with the 
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work done by the Air Force [Ref. 28] where they assumed that 
the samples are statistically independent, identically 
distributed Gaussian random variables. This iS an over 

Sem plification as it iS Known that the vocal tract is slowly 
varying with the tract parameters changing only every 10 
msec. 

The short-term cepstral representation of the averaged 
waveform was computed uSing the program 'CEP'. In keeping 
with Cox and Robinson the waveform was segmented into 
25 msec parts and each part was processed in sequence. 

Finally the analytic signal representation of the 
waveform was computed using a FIR Hilbert transformer witn 
79 weights, and a lower cutoff frequency of .05. The 
frequency response of this filter is shown in Figure 3.3. 
This particular filter was chosen over the other two 79 
weight filters because of its very small approximation 
error. The small approximation error does imply that the 
transition band of this filter is larger than the other two 
filters, however, this was deemed less important than the 
peak approximation error. 

Examples of these three representations of the same 
utterances can be found in Figures 4.1 thru 4.30. These 
examples are of a male 30 years old, born and raised in 
eastern Pennsylvania, and a Naval cryptologic officer. In 


order to display all 7680 points on one graph the waveform 
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was first normalized, then divided into four 1920 point 
parts. Each part was biased by (N-1) * 2, where N = 1, 2, 
3, 4, to permit graphing by the four segments on one page. 


The graphs should be read from left to right, top to bottom. 


eee DECISION ALGORITHM 

Once the speech had been processed a decision algorithm 
had to be formulated to classify utterances based on the 
patterns collected. All of the isolated word recognizers 
use a form of classical pattern recognition to classify 
utterances. The VRM system uses a nearest neighbor 
algorithm with a variable threshold. If no utterance is 
Within the distance specified by the threshold, an unable to 
classify message is issued. 

The nearest neighbor rule is an example of the pooled 
form of the nearest neighbor rule [Ref. 29]. For the two 
class case, a hemisphere is formed around the vector y to 
include k total samples regardless of their class. fThus 
K 4 + K 5 = k, where Ks equals the number of vectors belonging 
to class i. The quotient K,/K 5 is formed and compared to 
one. If K4/K5 > 1, then this implies there are more class 
one vectors in the hemisphere around y and the vector y is 
Said to belong to class one. If the converse of the 
inequality is true, k4/K> <1, then y is said to belong to 
class two. The probability of error for the case k=1 is 


less than twice the minimum probability of error for any 


decision rule. 
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The nearest neighbor rule was employed to classify the 
Meer ances. Using the program 'DEC', the Euclidean distance 
between a test vector and the stored patterns was computed. 
The results of this pattern matching are discussed in the 


next chapter. 
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Figure 4.1. 
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Figure 4.2. Analytic Representation of Zero 
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Cepstral Representation of Zero 
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Figure 4.4. 


Sampled Waveform, One 
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Figure 4.5. 
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Figure 4.6. 


Cepstral Representation of One 
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Figure 4.7. 


Sampled Waveform, Two 
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Figure 4.8. 


Analytic Representation of Two 
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Figure 4.9. Cepstral Representation of Two 
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Figure 4.10. Sampled Waveform, Three 
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Figure 4.11. 


Analytic Representation of Three 
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Figure 4.12. Cepstral Representation of Three 
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Sampled Waveform, Four 


Figure 4.13. 
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Analytic Representation of Four 


Figure 4.14. 
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Figure 4.15. 
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Cepstral Representation of Four 
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Sampled Waveform, Five 


Figure 4.16. 
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Figure 4.17. 


Analytic Representation of Five 
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Cepstral Representation of Five 
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Sampled Waveform, Six 


Figure 4.19. 
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Figure 4.20. 


Analytic Representation of 51x 
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Figure 4.21. Cepstral Representation of Six 
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Figure 4.22. 
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Figure 4.23. Analytic Representation of Seven 
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Figure 4.24. 


Cepstral Representation of Seven 
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Figure 4.25. Sampled Waveform, 


75 


Eight 


960 1280 1600 1920 
SAMPLE NUMBERS (K } 


640 


320 











ntl 





SIBLIOA QSZTIWWYON 


Figure 4.26. 


Analytic Representation of Eight 
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Cepstral Representation of Eight 
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Figure 4.28. 


Sampled Waveform, Nine 
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Analytic Representation of Nine 
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Figure 4.30. 


Cepstral Representation of Nine 
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Vee oeAmD CONCLUSIONS 


Ten speakers were selected to form the data base for the 
system. Their utterances were processed to obtain both 
their cepstral and analytic phase representations. The 
system was then tested uSing two groups of speakers. The 
first group, denoted Group A, consisted of speakers whose 
utterances were used to form the data base. Each speaker 
repeated the digits four times, and only three of these 
utterances were used to compute the average waveform and 
henee the cepstral and analytic phase representations. 
Group A can be thought of as having trained the system. The 
second group, Group B, consists of the other ten speakers. 

The system was tested uSing ten utterances per digit 
from each of the two groups of speakers. The reference 
pattern space was varied, using three different spaces each 
containing 100 patterns. The cepstral and analytic 
representations formed two of the reference Spaces, while 
the unprocessed signals formed the third space. Tables 5.1 
and 5.2 contain the results of the test. 

The results for Group A, in all categories, are below 
the results attainable with the VRM system. For three 
training passes the VRM system has a 97% recognition rate. 


The high percentage of recognition for the unprocessed 
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waveforms was to be expected since the speakers trained the 
system and the pattern space did consist of the average of 
each speaker's utterances. The distances between the 
pattern vectors and the test vector were of the same 
magnitude for the unprocessed waveforms, regardless of 
whether the utterance was correctly identified or not. [In 
the case of the short-term phase representations when the 
System correctly identified an utterance, the distance 
between the test vector and itsS nearest neighbor was an 
order of magnitude less than all the other distances. When 
the system incorrectly identified an utterance all distances 
were of the same magnitude. 

The success demonstrated in the speaker-dependent case 
memmot Without cost. As compared to the VRM system, which 
mes at most 120 bits/pattern, this svstem has 122.8K 
bits/pattern (7680 two byte numbers). There was an 
extensive amount of manual editing involved to obtain these 
patterns, on the order of ten minutes per utterance. 
However, it was Shown that short-term phase-only speech can 
be used to construct a speaker-dependent isolated word 
recognizer. 

The results for Group B appear to be abysmal, however, 
several things must be considered. First, there was no pre- 
processing of the signals to time-wrap them. Second, no 


features were extracted, only the entire waveforms were 
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used. Third, the decision algorithm may have to be tailored 
Momrie coe data, rather than using a general purpose 
decision rule. Last, but certainly not least, no system 
exists today that is completely speaker independent. 

One final observation concerning Group B. When the 
decision algorithm incorrectly identified any utterance it 
did so with a great deal of bias. In 30% of the cases where 
an utterance was incorrectly identified the number ‘one! was 
picked to be the nearest neighbor. 

This thesis was not an attempt to definitively answer 
maomoawestion, “is phase a physical invariant of speech?", 
Its purpose was to show that phase should be considered when 
constructing a word recognition system. This was accom- 
plished. The next step 1s to use the information obtained 
from the phase in conjunction with other word recognition 
systems eee improve these systems with the long 
range goal of solving the speaker-independent word 


mecoecnition problem. 
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Tee. 5.1 


GROUP & RECOGNITION RESULTS 
Pease  ONSTENPULITERANCES PER DIGIT 





Unprocessed Cepstral Analytic 
Digits Waveforms Representation Representation 
0 9 T 6 
1 10 8 4 
2 10 6 4 
9 5 3 
ul 10 5 p, 
5 10 i 5 
6 8 4 5 
1 9 3 
— 10 6 { 
9 10 fe 6 
AVG aD D9 4.9 
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ieee 5.2 


GROUP BaiwecOGNITION RESULTS 
BASED ON TEN UTTERANCES PER DIGIT 


Unprocessed Cepstral Analytic 
Digits Waveforms Representation Representation 
0 2 0 ] 
1 ul 3 3 
2 1 ’ 0 
2 0 0 a) 
1 0 1 
5 ] 0 0 
6 0 0 0 
i ] 0 0 
8 0 1 0 
9 2 1 0 
AUG 1.4 0 55 
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APPENDIX A 
[NSRRUSITON SHEET 
inanienyoullOcmmpacLvempacing im tne Speech Processing 
Lab's effort to collect speech samples. This exercise will 


require about 10 minutes of your time to complete. 


I. Biographical Data 


A. Name: 

B. Age: 

ee) JSOX:: 

D. Place of Birth: 
ie 66h UC CUD ane Len: 


Il. Speech Sampling 


A. Repeat each word on the list four times, pausing 
approximately 5 sec. between utterances. (For example: the 
first word on the list is 'zero', therefore you would say: 


'zero' (pause) 'zero' (pause) 'zero' (pause) 'zero' (pause) 
meme’ (pause) ..... 


zero Six 
one seven 
two eight 
three nine 
OU) @ 

five 


B. Repeat the following exercise 3 times: 


Read the entire list of numbers at your natural speaking 
rate pausing approx. 5 secs. before repeating the list. Do 
not pause unnaturally between the numbers. We are looking 
for continuous speech such as in a conversation. 


zero-one-two-three-four-five-six-siven-eight-nine (pause/repeat) 
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Mien Dl xX B 


COMPUTER PROGRAMS 


All programs were written in IBM FORTRAN H to run on the 
meee Churen Computer Center's IBM 3033. The programs 
fecess routines from the ISML library. The graphics 


programs interact with the DISSPLA graphics package. 
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